The purpose of this project is to generate a model that will be able to predict the price of an electric vehicle based on the features that vehicle comes with.
An electric car is a car that does not run on a petrol engine but rather runs on an electric battery. Some benefits to owning an electric car are that the car is quieter, no exhaust emissions, and low maintenance costs. Electric cars are becoming increasingly popular, due to the rise of gas prices, and the advancement in technology. The problem with electric cars before was that they were too expensive, they couldn’t travel far, and there weren’t many places to charge your vehicle. Most electric cars do not suffer from these problems anymore. Therefore, there is more reason to purchase an electric car.
embed_youtube("FDTf2d47P4A")
This model can be helpful by helping customers choose the features they want in their car and getting an estimated price based on what they chose. Another way they can use the model is if they have a specific budget for their electric car, then they can see what options are available to them.
electric_cars <- read.csv("Data/ElectricCarData_Clean.csv")
head(electric_cars)
## Brand Model AccelSec TopSpeed_KmH Range_Km
## 1 Tesla Model 3 Long Range Dual Motor 4.6 233 450
## 2 Volkswagen ID.3 Pure 10.0 160 270
## 3 Polestar 2 4.7 210 400
## 4 BMW iX3 6.8 180 360
## 5 Honda e 9.5 145 170
## 6 Lucid Air 2.8 250 610
## Efficiency_WhKm FastCharge_KmH RapidCharge PowerTrain PlugType BodyStyle
## 1 161 940 Yes AWD Type 2 CCS Sedan
## 2 167 250 Yes RWD Type 2 CCS Hatchback
## 3 181 620 Yes AWD Type 2 CCS Liftback
## 4 206 560 Yes RWD Type 2 CCS SUV
## 5 168 190 Yes RWD Type 2 CCS Hatchback
## 6 180 620 Yes AWD Type 2 CCS Sedan
## Segment Seats PriceEuro
## 1 D 5 55480
## 2 C 5 30000
## 3 D 5 56440
## 4 D 5 68040
## 5 B 4 32997
## 6 F 5 105000
is.numeric(electric_cars$FastCharge_KmH)
## [1] FALSE
electric_cars$FastCharge_KmH <- as.numeric(electric_cars$FastCharge_KmH)
electric_cars <- electric_cars %>%
clean_names()
head(electric_cars)
## brand model accel_sec top_speed_km_h range_km
## 1 Tesla Model 3 Long Range Dual Motor 4.6 233 450
## 2 Volkswagen ID.3 Pure 10.0 160 270
## 3 Polestar 2 4.7 210 400
## 4 BMW iX3 6.8 180 360
## 5 Honda e 9.5 145 170
## 6 Lucid Air 2.8 250 610
## efficiency_wh_km fast_charge_km_h rapid_charge power_train plug_type
## 1 161 940 Yes AWD Type 2 CCS
## 2 167 250 Yes RWD Type 2 CCS
## 3 181 620 Yes AWD Type 2 CCS
## 4 206 560 Yes RWD Type 2 CCS
## 5 168 190 Yes RWD Type 2 CCS
## 6 180 620 Yes AWD Type 2 CCS
## body_style segment seats price_euro
## 1 Sedan D 5 55480
## 2 Hatchback C 5 30000
## 3 Liftback D 5 56440
## 4 SUV D 5 68040
## 5 Hatchback B 4 32997
## 6 Sedan F 5 105000
is.numeric(electric_cars$fast_charge_km_h)
## [1] TRUE
I have cleaned the names and also made the variable “FastCharge_KmH” numerical instead of categorical because the values in the dataset are the charging rate of the charger from the vehicle. Therefore I believe those values should be numerical, to help improve the model’s calculations.
The dataset I am using is from the website Kaggle. This dataset
contains some variables that are important to a customer looking to buy
an electric vehicle. (Can also be found in codebook) The variables in
this dataset are as follows - Brand: The brand of the
vehicle - Model: The model of the vehicle -
AccelSec: The acceleration of the vehicle from 0-100 (km/h)
- TopSpeed_KmH: The top speed of the vehicle (km/h) -
Range_Km: The range of the vehicle on a full charge (km) -
Efficiency_WhKm: The efficiency of the electric vehicle in
watts per hour per kilometer (wh/km) - FastCharge_KmH: The
charging rate of an electric car in kilometer per hour (km/h) -
RapidCharge: If the car has the option of rapidcharge.
Rapidcharge is a faster charge rate than the typical charging rate -
PowerTrain: If the car has front, rear, or all-wheel drive
- PlugType: The type of plug the electric vehicle uses -
BodyStyle: The type of body style or car style the vehicle
is - Segment: The market segment the car is in -
Seats: The amount of seats the vehicle holds -
PriceEuro: The price of the vehicle before any tax
incentives
Factors that could result in a person’s purchase
As a person who’s family currently owns an electric car, there are some key aspects that are important to a customer when finding the right electric car for them. Depending on a person’s environment, needs, and wants, will help decide on what electric car would be perfect for them.
The acceleration of a vehicle could be important to customers who live in an environment with many hills and inclines. Another reason why a customer may be interested in a higher acceleration car is if they want a more “exciting” drive. A car with high acceleration will give the vehicle a more sportier feel.
The top speed of a vehicle could be important to a customer if they have a place where they could take advantage of the car’s top speed. A place could be the Autobahn. The Autobahn is a freeway in Germany does not have any speed limits in some area. So a high top speed vehicle would benefit here.
Fig 1. Autobahn, map of freeway in Germany
Top Speed and Acceleration usually seem to have some type of correlation with each other in gas powered vehicles. Usually if a car has a high top speed, they usually have a lower 0-100kmh time.
plot(electric_cars$top_speed_km_h,electric_cars$accel_sec, pch = 19, col = "lightblue", xlab="Top Speed (Kmh)", ylab = "Acceleration (0-100 kmh)", main = "Correlation between Acceleration and Top Speed", xlim = c(0,450), ylim = c(0, 25),)
# Regression line
abline(lm(electric_cars$accel_sec ~ electric_cars$top_speed_km_h), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(electric_cars$top_speed_km_h, electric_cars$accel_sec), 2)), x = 350, y = 20)
Based on the graph we can see that acceleration and top speed have a relatively strong negative correlation. Acceleration and Top Speed have a correlation of -0.79. From this graph we can say that the higher a top speed an electric car has, the lower 0-100kmh time.
embed_youtube("xnYPQjAHrQ8")
Out of all the variables of the electric car, I believe this is to be the most important variable of them all. Range tells how far the vehicle can travel before needing to be recharged. Having a shorter range on car can significantly decrease the amount of places the car can travel and how long it will take to get to each destination (due to having to recharge more often). At this period of time (2022), there are not as many charging stations as gas stations. Therefore, it is very important to know where all the nearest charging stations are when traveling, or else it will result in the car running out of electricity.
Going off charging stations, a faster charger would benefit. The benefit of gas cars over electric cars are that for a gas powered car, it takes a few minutes to refill. This is compared to electric cars that may take up to an hour or more to be completely charged. However, some cars have fast charging or even better rapid charging. This can get the electric cars to 80% charge in just 10-15 minutes.
ggplot(electric_cars, aes(x=range_km, y=fast_charge_km_h, colour = rapid_charge)) + geom_point() + facet_wrap( ~ rapid_charge)
sum(is.na(electric_cars$fast_charge_km_h))
## [1] 5
electric_cars %>%
ggplot(aes(x = rapid_charge, colour = rapid_charge)) + geom_bar()
Based on our graphs we can see that there are no observations in this dataset that do not have a rapid charge. However, when I use to check if there are any NA’s in the column for fast_charge_km_h, it says there are 5. This is because the vehicles with no rapid charging are labeled as NA in the fast_charge_km_h column. We can conclude that if the vehicle does not have rapid charging, then the vehicle cannot fast charge. These cars will most likely not sell as well as the vehicles with rapid charge.
electric_cars %>%
ggplot(aes(x = rapid_charge, y = range_km, colour = rapid_charge)) + geom_boxplot()
We can see that the cars with no rapid charge do not have a high range. The cars that do have rapid charge have a longer range and a faster charge rate. There also seems to be a positive correlation between range and fast charging. We can conclude that vehicles without rapid charging do not have a far range. We can also conclude that the faster the charger you have, the farther range your electric vehicle will have.
Of course price is truly important to customers. An electric car can contain all these cool features, but if the price is completely overpriced then it would not be worth it. Price is a key factor for everything we purchase. We buy things that we feel is worth the price. No matter how good an object it is, we always end up buying that item if the price is reasonable. This is the reason why I decided to make price my response variable in this project.
ggplot(electric_cars, aes(price_euro)) +
geom_histogram(bins = 70, color = "red") +
labs(title = "Range of Prices")
Based on the plot we can say that the price of the electric vehicles from this dataset, mainly ranges from 0 to 80,000. The vehicles that are above this range are most likely luxurious electric vehicles or high-end sport electric sport cars. The cars that are within this range are most likely efficiently friendly vehicles.
A brand may not seem as important when looking for the perfect electric vehicle for you, but to others it may be something well considering. A few reasons why a customer may want to go for a specific brand are: the brand has a great reputation, the buyer may have bought vehicles from the brand before, or the brand gives it’s buyer a type of prestige. A brand that is currently leading the electric vehicle market is Tesla. This is simply because Tesla has a premium-like quality, with the addition of multiple high-tech features such as “self-driving”.
Fig 2. Brand: Tesla, Model: Model S
electric_cars %>%
ggplot(aes(reorder(brand, price_euro), price_euro)) + geom_boxplot(varwidth = TRUE) + coord_flip() + labs(subtitle = "brands",x = NULL)
Based on our plot we can see that the top 3 brands are Lightyear, Porsche, and Lucid. Lucid an Lightyear only make electric cars so they are not as well known. Porsche is a well-known car brand for making gasoline powered vehicles. There vehicles cost more than the average vehicle so it makes sense that Porsche is third on our plot. Tesla arrives at fifth which also makes sense since they are vehicles are closer to the more luxurious side. The bottom 3 are Sono, Smart, and SEAT. Sono and SEAT are not as well known in the automotive world. However, Smart makes small eco-friendly vehicles that are cheaper than the average vehicle therefore it makes sense that they are one of the lowest on the list.
Fig 3. Brand: Smart, Model:fortwo Electric Drive
We have found some correlations already between some variables. However, let us graph a correlation plot with the numeric predictors to see if they are any more correlations.
cor_ec <- electric_cars %>%
select_if(is.numeric) %>%
correlate()
rplot(cor_ec)
cor_ec %>%
stretch() %>%
ggplot(aes(x,y,fill = r)) + geom_tile() + geom_text(aes(label = as.character(fashion(r))))
Based on the correlation plot we can see that accel_sec has strong negative correlations with the variables: top_speed_km_h, range_km, price_euro. Top_speed_km_h has strong positive correlations with the variables: range_km, price_euro. The last correlation that is relative noting is the positive correlation between range_km and price_euro.
The following models conducted were done in this order and procedure: 1. Building the model 2. Running the model 3. Making predictions using the model 4. Assessing it’s accuracy on the testing set
electric_cars <- electric_cars %>%
mutate(brand = factor(brand)) %>%
mutate(rapid_charge = factor(rapid_charge)) %>%
mutate(power_train = factor(power_train)) %>%
mutate(plug_type = factor(plug_type)) %>%
mutate(body_style = factor(body_style)) %>%
mutate(segment = factor(segment))
head(electric_cars)
## brand model accel_sec top_speed_km_h range_km
## 1 Tesla Model 3 Long Range Dual Motor 4.6 233 450
## 2 Volkswagen ID.3 Pure 10.0 160 270
## 3 Polestar 2 4.7 210 400
## 4 BMW iX3 6.8 180 360
## 5 Honda e 9.5 145 170
## 6 Lucid Air 2.8 250 610
## efficiency_wh_km fast_charge_km_h rapid_charge power_train plug_type
## 1 161 940 Yes AWD Type 2 CCS
## 2 167 250 Yes RWD Type 2 CCS
## 3 181 620 Yes AWD Type 2 CCS
## 4 206 560 Yes RWD Type 2 CCS
## 5 168 190 Yes RWD Type 2 CCS
## 6 180 620 Yes AWD Type 2 CCS
## body_style segment seats price_euro
## 1 Sedan D 5 55480
## 2 Hatchback C 5 30000
## 3 Liftback D 5 56440
## 4 SUV D 5 68040
## 5 Hatchback B 4 32997
## 6 Sedan F 5 105000
I mutated the variables by factoring the numerical predictors.
set.seed(12)
ec_split <- electric_cars %>%
initial_split(prop = 0.75, strata = "price_euro")
ec_train <- training(ec_split)
dim(ec_train) #75
## [1] 75 14
ec_test <- testing(ec_split)
dim(ec_test) #28
## [1] 28 14
I decided to split the data with a .75 proportion and I made the strata price_euro because we are trying to predict the price of a vehicle based on the other variables. After splitting the data we get 75 observations for the training set and 28 left over for the testing set.
ec_recipe <- recipe(price_euro ~ accel_sec + top_speed_km_h + range_km + efficiency_wh_km + fast_charge_km_h + rapid_charge + power_train + plug_type + body_style + segment + seats, data = ec_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_normalize(all_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_zv(all_nominal_predictors())
ec_folds <- vfold_cv(ec_train, strata = price_euro, v = 10, repeats = 5)
We made a recipe using the training set. The predictor variables we left out of the recipe are brand and model. I decided to leave them out because I wanted the prediction to be completely based on the features of the vehicle. I step_dummy() all nominal predictors to encode them as categorical predictors. I also step_normalize() to center and scale all the predictors. I step_novel and step_zv all nominal predictors so it would assign any previously unseen factor level to a new value and to remove any variables that contain only a single value.
ridge_spec <- linear_reg(penalty = tune(), mixture = 0) %>%
set_mode("regression") %>%
set_engine("glmnet")
ridge_workflow <- workflow() %>%
add_recipe(ec_recipe) %>%
add_model(ridge_spec)
set.seed(12)
penalty_grid <- grid_regular(penalty(range = c(1, 11)), levels = 50)
penalty_grid
## # A tibble: 50 x 1
## penalty
## <dbl>
## 1 10
## 2 16.0
## 3 25.6
## 4 40.9
## 5 65.5
## 6 105.
## 7 168.
## 8 268.
## 9 429.
## 10 687.
## # ... with 40 more rows
tune_res <- tune_grid(
ridge_workflow,
resamples = ec_folds,
grid = penalty_grid
)
tune_res
## # Tuning results
## # 10-fold cross-validation repeated 5 times using stratification
## # A tibble: 50 x 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [66/9]> Repeat1 Fold01 <tibble [100 x 5]> <tibble [1 x 3]>
## 2 <split [66/9]> Repeat1 Fold02 <tibble [100 x 5]> <tibble [1 x 3]>
## 3 <split [66/9]> Repeat1 Fold03 <tibble [100 x 5]> <tibble [1 x 3]>
## 4 <split [66/9]> Repeat1 Fold04 <tibble [100 x 5]> <tibble [1 x 3]>
## 5 <split [66/9]> Repeat1 Fold05 <tibble [100 x 5]> <tibble [1 x 3]>
## 6 <split [69/6]> Repeat1 Fold06 <tibble [100 x 5]> <tibble [1 x 3]>
## 7 <split [69/6]> Repeat1 Fold07 <tibble [100 x 5]> <tibble [1 x 3]>
## 8 <split [69/6]> Repeat1 Fold08 <tibble [100 x 5]> <tibble [1 x 3]>
## 9 <split [69/6]> Repeat1 Fold09 <tibble [100 x 5]> <tibble [1 x 3]>
## 10 <split [69/6]> Repeat1 Fold10 <tibble [100 x 5]> <tibble [1 x 3]>
## # ... with 40 more rows
##
## There were issues with some computations:
##
## - Warning(s) x50: A correlation computation is required, but `estimate` is constant...
##
## Use `collect_notes(object)` for more information.
autoplot(tune_res)
In this step we are adding the recipe to the ridge model. We are also making the workflow and grid for the tune_grid. We use the folds we did earlier for the tune_grid also.
collect_metrics(tune_res) %>%
arrange(-mean) %>%
head()
## # A tibble: 6 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 33932218. rmse standard 33289. 50 1992. Preprocessor1_Model33
## 2 54286754. rmse standard 33289. 50 1992. Preprocessor1_Model34
## 3 86851137. rmse standard 33289. 50 1992. Preprocessor1_Model35
## 4 138949549. rmse standard 33289. 50 1992. Preprocessor1_Model36
## 5 222299648. rmse standard 33289. 50 1992. Preprocessor1_Model37
## 6 355648031. rmse standard 33289. 50 1992. Preprocessor1_Model38
We collect the metrics of our regression tune and look at the mean and standard error.
best_penalty <- select_best(tune_res, metric = "rsq")
best_penalty
## # A tibble: 1 x 2
## penalty .config
## <dbl> <chr>
## 1 75431. Preprocessor1_Model20
ridge_final <- finalize_workflow(ridge_workflow, best_penalty)
ridge_final_fit <- fit(ridge_final, data = ec_train)
Ridge_Prediction <- predict(ridge_final_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Ridge_Prediction <- bind_cols(Ridge_Prediction, ec_test %>% dplyr::select(price_euro))
Ridge_Graph <- Ridge_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha = 1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Ridge_Accuracy <- augment(ridge_final_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
lasso_spec <-
linear_reg(penalty = tune(), mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
lasso_workflow <- workflow() %>%
add_recipe(ec_recipe) %>%
add_model(lasso_spec)
set.seed(12)
tune_res_lasso <- tune_grid(
lasso_workflow,
resamples = ec_folds,
grid = penalty_grid
)
tune_res_lasso
## # Tuning results
## # 10-fold cross-validation repeated 5 times using stratification
## # A tibble: 50 x 5
## splits id id2 .metrics .notes
## <list> <chr> <chr> <list> <list>
## 1 <split [66/9]> Repeat1 Fold01 <tibble [100 x 5]> <tibble [1 x 3]>
## 2 <split [66/9]> Repeat1 Fold02 <tibble [100 x 5]> <tibble [1 x 3]>
## 3 <split [66/9]> Repeat1 Fold03 <tibble [100 x 5]> <tibble [1 x 3]>
## 4 <split [66/9]> Repeat1 Fold04 <tibble [100 x 5]> <tibble [1 x 3]>
## 5 <split [66/9]> Repeat1 Fold05 <tibble [100 x 5]> <tibble [1 x 3]>
## 6 <split [69/6]> Repeat1 Fold06 <tibble [100 x 5]> <tibble [1 x 3]>
## 7 <split [69/6]> Repeat1 Fold07 <tibble [100 x 5]> <tibble [1 x 3]>
## 8 <split [69/6]> Repeat1 Fold08 <tibble [100 x 5]> <tibble [1 x 3]>
## 9 <split [69/6]> Repeat1 Fold09 <tibble [100 x 5]> <tibble [1 x 3]>
## 10 <split [69/6]> Repeat1 Fold10 <tibble [100 x 5]> <tibble [1 x 3]>
## # ... with 40 more rows
##
## There were issues with some computations:
##
## - Warning(s) x50: A correlation computation is required, but `estimate` is constant...
##
## Use `collect_notes(object)` for more information.
autoplot(tune_res_lasso)
In this step we are adding the recipe to the lasso model. We are also making the workflow and grid for the tune_grid. We use the folds we did earlier for the tune_grid also. The plots does not seem to be as smooth as the ridge model.
collect_metrics(tune_res_lasso) %>%
arrange(-mean) %>%
head()
## # A tibble: 6 x 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 47149. rmse standard 33289. 50 1992. Preprocessor1_Model19
## 2 75431. rmse standard 33289. 50 1992. Preprocessor1_Model20
## 3 120679. rmse standard 33289. 50 1992. Preprocessor1_Model21
## 4 193070. rmse standard 33289. 50 1992. Preprocessor1_Model22
## 5 308884. rmse standard 33289. 50 1992. Preprocessor1_Model23
## 6 494171. rmse standard 33289. 50 1992. Preprocessor1_Model24
We collect the metrics of our regression tune and look at the mean and standard error. They seem to be very similar and or not the same as the Ridge regression.
best_penalty_lasso <- select_best(tune_res_lasso, metric = "rsq")
lasso_final <- finalize_workflow(lasso_workflow, best_penalty_lasso)
lasso_final_fit <- fit(lasso_final, data = ec_train)
Lasso_Prediction <- predict(lasso_final_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Lasso_Prediction <- bind_cols(Lasso_Prediction, ec_test %>% dplyr::select(price_euro))
Lasso_Graph <- Lasso_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha=1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Lasso_Accuracy <- augment(lasso_final_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
boost_spec <- boost_tree() %>%
set_engine("xgboost") %>%
set_mode("regression")
boost_wf <- workflow() %>%
add_model(boost_spec %>%
set_args(trees = tune())) %>%
add_recipe(ec_recipe)
set.seed(12)
boost_grid <- grid_regular(trees(range = c(10, 2000)), levels = 50)
boost_tune_res <- tune_grid(
boost_wf,
resamples = ec_folds,
grid = boost_grid,
)
autoplot(boost_tune_res)
In this step we are adding the recipe to the Boost model. We are also making the workflow and grid for the tune_grid. We use the folds we did earlier for the tune_grid also. The tree plots here seem to go straight up and then flatten out, never changing it’s slope again.
collect_metrics(boost_tune_res) %>%
arrange(-mean) %>%
head()
## # A tibble: 6 x 7
## trees .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 50 rmse standard 13476. 30 2180. Preprocessor1_Model02
## 2 91 rmse standard 13475. 30 2180. Preprocessor1_Model03
## 3 131 rmse standard 13474. 30 2180. Preprocessor1_Model04
## 4 172 rmse standard 13474. 30 2180. Preprocessor1_Model05
## 5 213 rmse standard 13474. 30 2180. Preprocessor1_Model06
## 6 253 rmse standard 13474. 30 2180. Preprocessor1_Model07
We collect the metrics of our regression tune and look at the mean and standard error. The boost model seems to have a much lower mean compared to Lasso and Ridge. The boost also has a higher standard error compared to the two previous models.
best_boost_final <- select_best(boost_tune_res)
best_boost_final_model <- finalize_workflow(boost_wf, best_boost_final)
best_boost_final_model_fit <- fit(best_boost_final_model, data = ec_train)
Boost_Prediction <- predict(best_boost_final_model_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Boost_Prediction <- bind_cols(Boost_Prediction, ec_test %>% dplyr::select(price_euro))
Boost_Graph <- Boost_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha=1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Boost_Accuracy <- augment(best_boost_final_model_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
tree_spec <-decision_tree() %>%
set_engine("rpart")
class_tree_spec <- tree_spec %>%
set_mode("regression")
class_tree_wf <- workflow() %>%
add_model(class_tree_spec %>% set_args(cost_complexity = tune())) %>%
add_recipe(ec_recipe)
set.seed(12)
param_grid <- grid_regular(cost_complexity(range = c(-5, 5)), levels = 50)
tune_res_tree <- tune_grid(
class_tree_wf,
resamples = ec_folds,
grid = param_grid,
)
autoplot(tune_res_tree)
In this step we are adding the recipe to the Tree model. We are also making the workflow and grid for the tune_grid. We use the folds we did earlier for the tune_grid also. The cost_complexity parameter seems to have more similarities to the lasso plots. The reasoning is the plots are not smooth, and seem to have sudden changes of slope.
collect_metrics(tune_res_tree) %>%
arrange(-mean) %>%
head()
## # A tibble: 6 x 7
## cost_complexity .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.791 rmse standard 33289. 50 1992. Preprocessor1_Model25
## 2 1.26 rmse standard 33289. 50 1992. Preprocessor1_Model26
## 3 2.02 rmse standard 33289. 50 1992. Preprocessor1_Model27
## 4 3.24 rmse standard 33289. 50 1992. Preprocessor1_Model28
## 5 5.18 rmse standard 33289. 50 1992. Preprocessor1_Model29
## 6 8.29 rmse standard 33289. 50 1992. Preprocessor1_Model30
We collect the metrics of our regression tune and look at the mean and standard error. The tree model has the exact same mean and standard error as the Lasso and Ridge models.
best_complexity <- select_best(tune_res_tree)
class_tree_final <- finalize_workflow(class_tree_wf, best_complexity)
class_tree_final_fit <- fit(class_tree_final, data = ec_train)
class_tree_final_fit %>%
extract_fit_engine() %>%
rpart.plot()
The tree plot asks specific questions. These questions can only be answered yes or no. Looking at the tree plot we can see that the first question was about the accel_sec. The second line asked about the range_km. The third line asked about the accel_sec again or it asked about segment_c. Then it finally asked again about accel_sec. Going through this tree gave predictions of the car price. Looking at the plot, the model predicts: - 9% \(\approx\) 25,000 - 28% \(\approx\) 34,000 - 15% \(\approx\) 45,000 - 9% \(\approx\) 52,000 - 19% \(\approx\) 64,000 - 20% \(\approx\) 108,000
Tree_Prediction <- predict(class_tree_final_fit, new_data = ec_test %>% dplyr::select(-price_euro))
Tree_Prediction <- bind_cols(Tree_Prediction, ec_test %>% dplyr::select(price_euro))
Tree_Graph <- Tree_Prediction %>%
ggplot(aes(x=.pred, y=price_euro)) + geom_point(alpha=1) + geom_abline(lty = 2) + theme_bw() + coord_obs_pred()
Tree_Accuracy <- augment(class_tree_final_fit, new_data = ec_test) %>%
rsq(truth = price_euro, estimate = .pred)
Here we prepare the predictions, graphs, and plots, for comparison at the end.
Comparison of the four different models: We will compare the four different models in this by these factors: - Predictions - Graphs - Accuracy Whichever model has the highest accuracy, would be the best model to predict the price of an electric car based on their features.
We are going to look over the first five predictions compared to the actual price of the vehicle. Note: The actual prices are the same for each model. Ridge
head(Ridge_Prediction)
## # A tibble: 6 x 2
## .pred price_euro
## <dbl> <int>
## 1 76569. 55480
## 2 40633. 30000
## 3 55064. 68040
## 4 36748. 31900
## 5 65309. 46380
## 6 48400. 40795
Lasso
head(Lasso_Prediction)
## # A tibble: 6 x 2
## .pred price_euro
## <dbl> <int>
## 1 65651. 55480
## 2 42570. 30000
## 3 52921. 68040
## 4 36030. 31900
## 5 58694. 46380
## 6 46439. 40795
Boost
head(Boost_Prediction)
## # A tibble: 6 x 2
## .pred price_euro
## <dbl> <int>
## 1 51698. 55480
## 2 33604. 30000
## 3 55283. 68040
## 4 33153. 31900
## 5 39800. 46380
## 6 35191. 40795
Tree
head(Tree_Prediction)
## # A tibble: 6 x 2
## .pred price_euro
## <dbl> <int>
## 1 108442. 55480
## 2 33732. 30000
## 3 64206. 68040
## 4 33732. 31900
## 5 44868. 46380
## 6 64206. 40795
Looking at the predictions of each model, I would personally say that the boost model has the best predictions. I think it’s predictions overall are closer to the actual value of the vehicle.
figure <- ggarrange(Ridge_Graph, Lasso_Graph, Boost_Graph,Tree_Graph,
labels = c("Ridge", "Lasso", "Boost","Tree"),
ncol = 2, nrow = 2)
figure
In the plots the dotted line represents where the points would be if the actual price of the vehicle was the same number as the prediction. Looking at the plots I can say I would say that the Lasso and Ridge have the points closest to the dotted line meaning they most likely have the highest accuracy between the four models.
accuracy_comparisons <- bind_rows(Ridge_Accuracy, Lasso_Accuracy, Boost_Accuracy, Tree_Accuracy) %>%
tibble() %>% mutate(model = c("Ridge", "Lasso", "Boost", "Tree")) %>%
dplyr::select(model, .estimate) %>%
arrange(.estimate)
accuracy_comparisons
## # A tibble: 4 x 2
## model .estimate
## <chr> <dbl>
## 1 Tree 0.565
## 2 Boost 0.681
## 3 Ridge 0.815
## 4 Lasso 0.862
Looking at the accuracy of the four different models we see that the Lasso model had the highest accuracy and the tree model had the lowest accuracy. Ridge was after Lasso then Boost, then finally Tree.
Going through the dataset we found a couple of correlations between the variables. After we decided to produce four different models. These models were: Ridge, Lasso, Boost, and Tree. Analyzing the results of the models we have that Lasso had the highest predicted accuracy and Tree with the lowest predicted accuracy. I believe the reason why the tree model did poorly was because I think the tree model is more successful with categorical data. The data we used mainly consisted of numerical data. Based on this research, by using a Lasso model we can accurately predict the price of an electric vehicle about 86% of the time.